Auto-Relevancy and Responsiveness Baseline II Improving Concept Search to Establish a Subset with Maximized Recall for Automated First Pass and Early Assessment Using Latent Semantic Indexing [LSI], Bigrams and WordNet 3.0 Seeding

نویسنده

  • Cody Bennett
چکیده

We experiment with manipulating the features at build time by indexing bigrams created from EDRM data and seeding the LSI index with thesaurus-like WordNet 3.0 strata. From experimentation, this produces fewer false positives and a smaller, more focused relevant set. The method allows concept searching using bigrams and WordNet senses in addition to singular terms increasing polysemous value and precision; steps towards a unification of Semantic and Statistical. Also, because of LSI and WordNet senses, WSD appears enhanced. We then apply an automated method for selecting search criteria, query expansion and concept searching from Reviewer Guidelines and the original Request for Production thereby returning a search result with scores across the Enron corpus for each topic. The result of the normalized cosine distance score for each document in each topic is then shifted based on the foundation of primes, golden standard, and golden ratio. This results in 'best cutoff' using naturally occurring patterns in probability of expected relevancy with limit approaching .5. Submissions A1, A2, A3, and AF include similar combinations of the above. Although we did not submit a mopup run, we analyzed the mopups for post assessment. For each of the three topics, there were documents which TAs selected as relevant in contention with their other personal assessments. The defect percentage and potential impact to a semi/automated system will also be examined. Overall the influence of humans involved (TAs) was very minimal, as their assessments were not allowed to modify any rank or probability of documents. However, the identification of relevant documents by TAs at low LSI thresholds provided a feedback loop to affect the natural cutoff. Cutoffs for A1, A2, A3 were nearly -.04 (Landau) against the Golden and Poisson means and F was nearly +.04 (Apéry). Since more work is required to decrease false positives, it is encouraging to find a natural relevancy cutoff that maximizes probable Recall of Responsiveness across differing topics. Automated concept search using a mechanically generated semantically derived feature set upon indexed bigram and WordNet sense terms in an LSI framework reduces false positives and produces a tighter cluster of potentially responsive documents. Further, since legal Productions are essentially binary (R/NR), work was done to argue for scoring supporting this view. Obtaining Recall =>90% and Precision =>90% with a high degree of success is a two step process, of which we test and discuss the first (maximization of Recall) for this study. Therefore, our 1 During initial data assessment, automated maximization of Recall should be of highest value, since the Recall will carry over to human assisted systems such as Technology Assisted Review, and/or other search methodologies whose focus is to maximize Precision. In tandem, the approach will give a higher probability of attaining max P/R, and use hybridization techniques allowing for semi/ automated capabilities. focus will be heavily skewed on the probability of attaining high Recall for the creation of a subset of the corpus. Main Experiment Methods See the TREC website for details on the mock Requests for Production, Reviewer Guidelines per topic and other information regarding scoring and assessing. Team TCDI’s participation will be discussed without the repetition of most of that information. Baseline Participation TCDI’s baseline submissions assume that by building a blind automated mechanism, the result is a distribution useful as a statistical snapshot, part of a knowledge and/or eDiscovery paradigm, and/or ongoing quality assurance and control within large datasets and topic training strata. Further, corporations’ Information Management architectures currently deployed can offer hidden insights of relevancy when historically divergent systems are hybridized. For TREC Legal Track 2011, TCDI’s baseline submission considers a hybridization of NLP, Semantic and LSI systems. 4 runs were submitted of 5 we did not submit a "mopup" run. For runs A1, A2 and A3, some keyword filtering was tested. The Final run, AF used no keyword filtering. Multiple side experiments were performed, some discussed further. Steps for running the main experiment are listed below. Feature Build for Indexing [STEP 0] Baselines were submitted to TREC Legal using: • 685,592 de-duped Enron emails and attachments conceptually indexed • Additional features per document beyond unigrams: o bigrams produced by a simple algorithm o small set of randomly selected WordNet 3.0 senses • 3 Topics Data inputs were the mock Requests for Production, Reviewer Guidelines, and phone conversations. Similar to some Web methods, the verbiage within the legal documents and discussions were expanded upon using a mixture of Natural Language Processing, WordNet sense non-linear distance, LSI 2 Keyword vs. concept, concept vs. probabilistic, concept vs. semantic, etc. Esp. with IR systems, hybridization offers revitalization and ROI longevity. 3 The semantic and conceptual systems could be considered plug and play for different approaches. The approach is considered modular as long as a topic model is available and exemplar data is available specifying relevant and non-relevant information. 4 ContentAnalyst

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Query Architecture Expansion in Web Using Fuzzy Multi Domain Ontology

Due to the increasing web, there are many challenges to establish a general framework for data mining and retrieving structured data from the Web. Creating an ontology is a step towards solving this problem. The ontology raises the main entity and the concept of any data in data mining. In this paper, we tried to propose a method for applying the "meaning" of the search system, But the problem ...

متن کامل

Automatic Construction of Persian ICT WordNet using Princeton WordNet

WordNet is a large lexical database of English language, in which, nouns, verbs, adjectives, and adverbs are grouped into sets of cognitive synonyms (synsets). Each synset expresses a distinct concept. Synsets are interlinked by both semantic and lexical relations. WordNet is essentially used for word sense disambiguation, information retrieval, and text translation. In this paper, we propose s...

متن کامل

Auto-Relevancy Baseline: A Hybrid System Without Human Feedback

On obtaining a Request for Production and automatically emulating a typical eDiscovery workflow, a simple application of the classical Bayes algorithm upon the pseudo-hybridization of Semantic and Latent Semantic Indexing systems should smooth out historically high yet noisy Recall of some LSI models and their derivatives and produce a tighter linear distribution when assessing relevant documen...

متن کامل

Using Latent Semantic Index for Content-Based Image Retrieval

In this paper, the latent semantic indexing (LSI) based method was used to the various image feature extracted matrix in order to perform content-based image retrieval. The feature extraction techniques include color histogram, color auto-gram, color moment, gray-scale, and wavelet moment. The implementation of LSI here is to achieve an improved image retrieval performance, because it reduces t...

متن کامل

LRLW-LSI: An Improved Latent Semantic Indexing (LSI) Text Classifier

The task of Text Classification (TC) is to automatically assign natural language texts with thematic categories from a predefined category set. And Latent Semantic Indexing (LSI) is a well known technique in Information Retrieval, especially in dealing with polysemy (one word can have different meanings) and synonymy (different words are used to describe the same concept), but it is not an opti...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2011